108 research outputs found
Distance Preserving Graph Simplification
Large graphs are difficult to represent, visualize, and understand. In this
paper, we introduce "gate graph" - a new approach to perform graph
simplification. A gate graph provides a simplified topological view of the
original graph. Specifically, we construct a gate graph from a large graph so
that for any "non-local" vertex pair (distance higher than some threshold) in
the original graph, their shortest-path distance can be recovered by
consecutive "local" walks through the gate vertices in the gate graph. We
perform a theoretical investigation on the gate-vertex set discovery problem.
We characterize its computational complexity and reveal the upper bound of
minimum gate-vertex set using VC-dimension theory. We propose an efficient
mining algorithm to discover a gate-vertex set with guaranteed logarithmic
bound. We further present a fast technique for pruning redundant edges in a
gate graph. The detailed experimental results using both real and synthetic
graphs demonstrate the effectiveness and efficiency of our approach.Comment: A short version of this paper will be published for ICDM'11, December
201
A Deep Embedding Model for Co-occurrence Learning
Co-occurrence Data is a common and important information source in many
areas, such as the word co-occurrence in the sentences, friends co-occurrence
in social networks and products co-occurrence in commercial transaction data,
etc, which contains rich correlation and clustering information about the
items. In this paper, we study co-occurrence data using a general energy-based
probabilistic model, and we analyze three different categories of energy-based
model, namely, the , and models, which are able to capture
different levels of dependency in the co-occurrence data. We also discuss how
several typical existing models are related to these three types of energy
models, including the Fully Visible Boltzmann Machine (FVBM) (), Matrix
Factorization (), Log-BiLinear (LBL) models (), and the Restricted
Boltzmann Machine (RBM) model (). Then, we propose a Deep Embedding Model
(DEM) (an model) from the energy model in a \emph{principled} manner.
Furthermore, motivated by the observation that the partition function in the
energy model is intractable and the fact that the major objective of modeling
the co-occurrence data is to predict using the conditional probability, we
apply the \emph{maximum pseudo-likelihood} method to learn DEM. In consequence,
the developed model and its learning method naturally avoid the above
difficulties and can be easily used to compute the conditional probability in
prediction. Interestingly, our method is equivalent to learning a special
structured deep neural network using back-propagation and a special sampling
strategy, which makes it scalable on large-scale datasets. Finally, in the
experiments, we show that the DEM can achieve comparable or better results than
state-of-the-art methods on datasets across several application domains
Compiler and runtime support for shared memory parallelization of data mining algorithms
Abstract. Data mining techniques focus on finding novel and useful patterns or models from large datasets. Because of the volume of the data to be analyzed, the amount of computation involved, and the need for rapid or even interactive analysis, data mining applications require the use of parallel machines. We have been developing compiler and runtime support for developing scalable implementations of data mining algorithms. Our work encompasses shared memory parallelization, distributed memory parallelization, and optimizations for processing disk-resident datasets. In this paper, we focus on compiler and runtime support for shared memory parallelization of data mining algorithms. We have developed a set of parallelization techniques that apply across algorithms for a variety of mining tasks. We describe the interface of the middleware where these techniques are implemented. Then, we present compiler techniques for translating data parallel code to the middleware specification. Finally, we present a brief evaluation of our compiler using apriori association mining and k-means clustering.
On Estimating Recommendation Evaluation Metrics under Sampling
Since the recent study (Krichene and Rendle 2020) done by Krichene and Rendle
on the sampling-based top-k evaluation metric for recommendation, there has
been a lot of debates on the validity of using sampling to evaluate
recommendation algorithms. Though their work and the recent work (Li et
al.2020) have proposed some basic approaches for mapping the sampling-based
metrics to their global counterparts which rank the entire set of items, there
is still a lack of understanding and consensus on how sampling should be used
for recommendation evaluation. The proposed approaches either are rather
uninformative (linking sampling to metric evaluation) or can only work on
simple metrics, such as Recall/Precision (Krichene and Rendle 2020; Li et al.
2020). In this paper, we introduce a new research problem on learning the
empirical rank distribution, and a new approach based on the estimated rank
distribution, to estimate the top-k metrics. Since this question is closely
related to the underlying mechanism of sampling for recommendation, tackling it
can help better understand the power of sampling and can help resolve the
questions of if and how should we use sampling for evaluating recommendation.
We introduce two approaches based on MLE (MaximalLikelihood Estimation) and its
weighted variants, and ME(Maximal Entropy) principals to recover the empirical
rank distribution, and then utilize them for metrics estimation. The
experimental results show the advantages of using the new approaches for
evaluating recommendation algorithms based on top-k metrics
Towards Reliable Item Sampling for Recommendation Evaluation
Since Rendle and Krichene argued that commonly used sampling-based evaluation
metrics are "inconsistent" with respect to the global metrics (even in
expectation), there have been a few studies on the sampling-based recommender
system evaluation. Existing methods try either mapping the sampling-based
metrics to their global counterparts or more generally, learning the empirical
rank distribution to estimate the top- metrics. However, despite existing
efforts, there is still a lack of rigorous theoretical understanding of the
proposed metric estimators, and the basic item sampling also suffers from the
"blind spot" issue, i.e., estimation accuracy to recover the top- metrics
when is small can still be rather substantial. In this paper, we provide an
in-depth investigation into these problems and make two innovative
contributions. First, we propose a new item-sampling estimator that explicitly
optimizes the error with respect to the ground truth, and theoretically
highlight its subtle difference against prior work. Second, we propose a new
adaptive sampling method which aims to deal with the "blind spot" problem and
also demonstrate the expectation-maximization (EM) algorithm can be generalized
for such a setting. Our experimental results confirm our statistical analysis
and the superiority of the proposed works. This study helps lay the theoretical
foundation for adopting item sampling metrics for recommendation evaluation,
and provides strong evidence towards making item sampling a powerful and
reliable tool for recommendation evaluation.Comment: aaai202
Zone-based Federated Learning for Mobile Sensing Data
Mobile apps, such as mHealth and wellness applications, can benefit from deep
learning (DL) models trained with mobile sensing data collected by smart phones
or wearable devices. However, currently there is no mobile sensing DL system
that simultaneously achieves good model accuracy while adapting to user
mobility behavior, scales well as the number of users increases, and protects
user data privacy. We propose Zone-based Federated Learning (ZoneFL) to address
these requirements. ZoneFL divides the physical space into geographical zones
mapped to a mobile-edge-cloud system architecture for good model accuracy and
scalability. Each zone has a federated training model, called a zone model,
which adapts well to data and behaviors of users in that zone. Benefiting from
the FL design, the user data privacy is protected during the ZoneFL training.
We propose two novel zone-based federated training algorithms to optimize zone
models to user mobility behavior: Zone Merge and Split (ZMS) and Zone Gradient
Diffusion (ZGD). ZMS optimizes zone models by adapting the zone geographical
partitions through merging of neighboring zones or splitting of large zones
into smaller ones. Different from ZMS, ZGD maintains fixed zones and optimizes
a zone model by incorporating the gradients derived from neighboring zones'
data. ZGD uses a self-attention mechanism to dynamically control the impact of
one zone on its neighbors. Extensive analysis and experimental results
demonstrate that ZoneFL significantly outperforms traditional FL in two models
for heart rate prediction and human activity recognition. In addition, we
developed a ZoneFL system using Android phones and AWS cloud. The system was
used in a heart rate prediction field study with 63 users for 4 months, and we
demonstrated the feasibility of ZoneFL in real-life
- …